Speech-Based Speaker Recognition Systems and Methods

ABSTRACT

The illustrative embodiments described herein provide systems and methods for authenticating a speaker. In one embodiment, a method includes receiving reference speech input including a reference passphrase to form a reference recording, and receiving test speech input including a test passphrase to form a test recording. The method includes determining whether the test passphrase matches the reference passphrase, and determining whether one or more voice features of the speaker of the test passphrase matches one or more voice features of the speaker of the reference passphrase. The method authenticates the speaker of the test speech input in response to determining that the reference passphrase matches the test passphrase and that one or more voice features of the speaker of the test passphrase matches one or more voice features of the speaker of the reference passphrase.

TECHNICAL FIELD OF THE INVENTION

The illustrative embodiments relate generally to speech recognition, andmore particularly, to identifying, or authenticating, a speaker usingspeech-based speaker recognition systems and methods.

BACKGROUND OF THE INVENTION

Speech and voice recognition technologies have found increased usage inmany and varied applications as the technology underlying speechrecognition has become more advanced. For example, speech recognitiontechnology is used in speech-to-text applications, telephonicinteractive voice response (IVR) applications, speech commandapplications, etc. One potential application involves the use of speechrecognition technology to authenticate the identity of a person, orspeaker, using his or her speech, including the content of his or herspeech.

Current speaker authentication systems may suffer from seriousdeficiencies, such as unacceptably low accuracy when attempting toidentify a speaker based on his or her speech. Such deficiencies canyield devastating results if these systems are used in dangerousenvironments, such as the authentication of prisoners in a prison forthe purpose of determining whether to provide the prisoner with aparticular service. The deficiencies in current systems can alsoadversely affect the service provided by businesses that rely on speechrecognition technology to authenticate the customers, or otherindividuals, associated with their business. Current systems may alsolack customizable settings including, but not limited to, the ability toadjust the stringency with which a speaker is authenticated. Due to thelack of customizable settings, current systems may fail to be versatileenough for use in varied environments.

SUMMARY OF THE INVENTION

According to an illustrative embodiment, a method for authenticating aspeaker includes receiving reference speech input including a referencepassphrase to form a reference recording, and determining a referenceset of feature vectors for the reference recording. The reference set offeature vectors have a time dimension. The method also includesreceiving test speech input including a test passphrase to form a testrecording, and determining a test set of feature vectors for the testrecording. The test set of feature vectors have the time dimension. Themethod also includes correlating the reference set of feature vectorswith the test set of feature vectors over the time dimension, andcomparing the reference set of feature vectors to the test set offeature vectors to determine whether the test passphrase matches thereference passphrase in response to correlating the reference set offeature vectors with the test set of feature vectors over the timedimension. The method also includes determining a reference fundamentalfrequency of the reference recording, determining a test fundamentalfrequency of the test recording, comparing the reference fundamentalfrequency to the test fundamental frequency to determine whether aspeaker of the test speech input matches a speaker of the referencespeech input, and authenticating the speaker of the test speech input inresponse to determining that the reference passphrase matches the testpassphrase and that the speaker of the test speech input matches thespeaker of the reference speech input.

According to another illustrative embodiment, a speech-based speakerrecognition system includes a passphrase recognition module to determinewhether a test passphrase spoken as test speech input matches areference passphrase spoken as reference speech input. The system alsoincludes a voice feature recognition module to determine whether a pitchof a speaker of the test passphrase matches a pitch of a speaker of thereference passphrase. The system also includes a recording storage tostore a reference speech recording accessible by the passphraserecognition module and the voice feature recognition module. Thereference speech recording includes the reference passphrase.

According to another illustrative embodiment, a method forauthenticating a speaker includes receiving reference speech inputincluding a reference passphrase to form a reference recording anddetermining a reference set of feature vectors for the referencerecording. The reference set of feature vectors has a time dimension.The method includes receiving test speech input including a testpassphrase to form a test recording and determining a test set offeature vectors for the test recording. The test set of feature vectorshas the time dimension. The method includes classifying each frame inthe reference set of feature vectors and the test set of feature vectorsas one of a voiced frame or a silent frame to form a voiced referenceset of feature vectors and a voiced test set of feature vectors,comparing the voiced reference set of feature vectors to the voiced testset of feature vectors to determine a length ratio, and determiningwhether the test passphrase is different from the reference passphrasebased on the length ratio. The method also includes correlating thevoiced reference set of feature vectors with the voiced test set offeature vectors over the time dimension and comparing the voicedreference set of feature vectors to the voiced test set of featurevectors to determine whether the test passphrase matches the referencepassphrase in response to correlating the voiced reference set offeature vectors with the voiced test set of feature vectors over thetime dimension. The method includes determining a set of referencefundamental frequency values for the reference recording, determining aset of test fundamental frequency values for the test recording,identifying a set of local peak fundamental frequency values in the setof reference fundamental frequency values and the set of testfundamental frequency values, excluding the set of local peakfundamental frequency values from the set of reference fundamentalfrequency values and the set of test fundamental frequency values toform a modified set of reference fundamental frequency values and amodified set of test fundamental frequency values, comparing themodified set of reference fundamental frequency values to the modifiedset of test fundamental frequency values to determine whether a speakerof the test speech input matches a speaker of the reference speechinput, and authenticating the speaker of the test speech input inresponse to determining that the reference passphrase matches the testpassphrase and that the speaker of the test speech input matches thespeaker of the reference speech input.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic, pictorial representation of a speech-basedspeaker recognition system according to an illustrative embodiment;

FIG. 2 is a schematic diagram showing the interaction between theelements of the speech-based speaker recognition system in FIG. 1according to an illustrative embodiment;

FIG. 3 is a schematic, block diagram of a speech-based speakerrecognition system according to an illustrative embodiment;

FIG. 4 is a flowchart of a speech-based process for authenticating aspeaker according to an illustrative embodiment;

FIG. 5 is a flowchart of a speech-based process for authenticating aspeaker according to another illustrative embodiment;

FIG. 6 is a flowchart of a process that utilizes a length ratio tocompare a test passphrase to a reference passphrase according to anillustrative embodiment;

FIG. 7 is a flowchart of a process that determines, modifies, andcompares reference and test fundamental frequency values according to anillustrative embodiment; and

FIG. 8 is a schematic, block diagram of a data processing system inwhich the illustrative embodiments may be implemented.

DETAILED DESCRIPTION OF THE INVENTION

Referring to FIGS. 1 and 2, a speech-based speaker recognition system100 includes a speech-based speaker authenticator 102 that receives testspeech input 104 from one or more speakers 106. The test speech input104 may be received from the speaker 106 via a communication device 108,such as a phone. The test speech input 104 includes a test passphrase110 that may be compared with a reference passphrase 112 that is part ofone or more reference recordings 114. The test passphrase 110 and thereference passphrase 112 may each include one or more words, phonemes,phrases, or any other combination of speech characters. In onenon-limiting example, the test passphrase 110 and the referencepassphrase 112 may be all or part of the name of the speaker, such asthe name of the speaker of the reference passphrase 112 or the name ofany person or entity for which authentication is desired. Unlessotherwise indicated, as used herein, “or” does not require mutualexclusivity. The reference passphrase 112 may be recorded more than onceso that multiple reference recordings 114 may be compared to the testspeech input 104.

The reference speech recording may be stored in one or more servers 116implementing the speech-based speaker authenticator 102. After comparingthe test passphrase 110 to the reference passphrase 112, thespeech-based speaker authenticator 102 may then determine whether thespeaker 106 should be authenticated. The speaker 106 is authenticated ifhe or she is the same speaker as the speaker of the reference passphrase112.

With particular reference to FIG. 2, an illustrative embodiment of theinteraction between the elements of FIG. 1 is shown in which a speaker,such as the speaker 106, speaks reference speech input containing thereference passphrase 112 to the speech-based speaker authenticator 102via the communication device 108 (data communication 118). Thespeech-based speaker authenticator 102 may then store the referencerecording 114 that contains the reference passphrase 112 (process 120).The reference recording 114 may be stored, for example, on the server116 implementing the speech-based speaker authenticator 102.

The reference passphrase 112 may then be used as a standard againstwhich to authenticate any subsequent speakers. Anytime after storing thereference recording 114, the speaker 106 may speak the test speech input104, which contains the test passphrase 110, to the speech-based speakerauthenticator 102 via the communication device 108 (data communication122). The reference speech input and the test speech input 104 may eachbe spoken by the same speaker 106, in which case the speaker 106 isauthenticated. In another scenario, the speaker 106 of the test speechinput 104 may be a different speaker than the speaker of the referencespeech input, in which case the speaker 106 is not authenticated.

In one embodiment, the speech-based speaker authenticator 102 uses atwo-part authentication process to determine whether the speaker 106matches the speaker of the reference passphrase 112. The parts of theauthentication process may be executed in any order. In one part of theprocess, the speech-based speaker authenticator 102 may determinewhether the test passphrase 110 matches the reference passphrase 112(process 124). The process 124 focuses primarily on whether the testpassphrase 110 is the same, or substantially similar to, the referencepassphrase 112, as opposed to whether one or more voice features of thespeaker 106 matches one or more voice features of the speaker of thereference passphrase 112. Thus, the process 124 may be considered to bea speaker-independent authentication process. For example, if thereference passphrase 112 is the name of the speaker of the referencepassphrase (e.g., John Smith), the process 124 determines whether thetest passphrase 110 spoken by the speaker 106 includes all or a portionof the name of the speaker of the reference passphrase. Additionaldetails regarding the process 124 used to determine whether the testpassphrase 110 matches the reference passphrase 112 are provided below.

Another part of the authentication process executed by the speech-basedspeaker authenticator 102 may determine whether a voice feature of thespeaker 106 of the test passphrase 110 matches a voice feature of thespeaker of the reference passphrase 112 (process 126). The voice featuremay be any ascertainable feature of the voice of a speaker, such aspitch, a fundamental frequency estimate, volume, intonation, anymathematical interpretation or representation of the speaker's voice, orother characteristics of the speech frequency spectrum. As opposed tothe process 124, which is speaker-independent, the process 126 may beconsidered speaker-dependent because authentication of the speaker 106depends upon the particular voice features of the speaker 106 and thespeaker of the reference passphrase 112. For example, in the previousexample in which the reference passphrase 112 is the name of the speakerof the reference passphrase 112 (e.g., John Smith), the process 126 maycompare the pitch of the voice that speaks the reference passphrase 112with the pitch of the voice of the speaker 106, which speaks the testpassphrase 110. In this example, the actual words contained in thereference passphrase 112 and the test passphrase 110 play less of a rolethan the pitch of the respective voices used to speak the referencepassphrase 112 and the test passphrase 110. Additional details regardingthe process 126 used to determine whether a voice feature of the speaker106 matches a voice feature of the speaker of the reference passphrase112 are provided below.

If the process 124 determines that the test passphrase 110 matches thereference passphrase 112, and the process 126 determines that one ormore voice features of the speaker 106 matches one or more voicefeatures of the speaker of the reference passphrase 112, than thespeech-based speaker authenticator 102 may determine that the speaker106 is the same person as the speaker of the reference passphrase 112,thereby authenticating the speaker 106. In another embodiment, thespeaker 106 may be authenticated if a match is found by any one of theprocesses 124 or 126.

In one embodiment, the speech-based speaker authenticator 102 may sendspeaker authentication data 128 to an access-protected entity 130 afterdetermining whether to authenticate the speaker 106 (data communication132). The speaker authentication data 128 includes data regardingwhether the speaker 106 was authenticated by the speech-based speakerauthenticator 102. The access-protected entity 130 may be any entity orservice to which access depends upon whether the speaker 106 has beenauthenticated. Also, the speech-based speaker authenticator 102 may bepart of the access-protected entity 130, and may be located on or offthe premises of the access-protected entity 130. In another embodiment,the speech-based speaker authenticator 102 is administered, orassociated, with an entity or person that is at least partially separatefrom the access-protected entity 130, such as an authentication service.

By way of non-limiting example, the access-protected entity 130 may be aprison that conditions a service, such as the placement of phone callsby its prisoners, on authenticating the person attempting to place thephone call. In this example, a prisoner in the prison may provide thereference passphrase 112, such as the prisoner's name, which is recordedand stored as the reference recording 114. The prisoner that records thereference passphrase 112 may be associated with an individual accountthat grants and denies the prisoner certain calling permissions, such asthe ability or inability to call certain persons. The callingpermissions granted or denied to the prisoner may depend on theprisoner's circumstances, including any restraining orders applicable tothe prisoner, or witnesses or lawyers associated with the prisoner. Theaccount associated with the prisoner may also have certain attributes,such as an amount of money with which to place phone calls. Thereafter,any person wishing to place a phone call under the prisoner's accountmust speak a test passphrase 110 that matches the reference passphrase112 to the speech-based speaker authenticator 102 so that thespeech-based speaker authenticator 102 can verify that the speaker 106wishing to place the phone call is, in fact, the same person as theprisoner who recorded the reference passphrase 112. The speech-basedspeaker authenticator 102 may also prevent a prisoner from accessing anaccount other than his or her own, which may be useful in preventing theprisoner from placing a phone call that would be prohibited by theprisoner's own account, such as a threatening phone call to the victimof his or her crime.

In addition to the non-limiting example given above, the speech-basedspeaker authenticator 102 may be used in a wide variety of environmentsin which speaker authentication is advantageous. For example, theaccess-protected entity 130 may be a business that wishes to preventunauthorized access to the business's customer accounts. In this case,the customer or potential customer may be asked to provide a passphrase,such as his or her name or other password, in order to access his or heraccount. Each account may be customized for each customer. If or whenthe speech-based speaker authenticator 102 authenticates the speaker106, the speaker will be allowed access to his or her customer account,including any privileges, restrictions, or attributes associatedtherewith.

The communication device 108 may be any device capable of receiving andtransmitting speech. Non-limiting examples of the communication device108 include landline phones, Voice Over Internet Protocol (VOIP) phones,cellular phones, smart phones, walkie talkies, computers (e.g.,desktops, laptops, netbooks, and minicomputers), personal digitalassistants, digital music players, digital readers, portable gamingdevices, web browsing devices, media players, etc. Although the possibledevices represented by the communication device 108 are numerous; in thenon-limiting example of FIG. 1, the communication device 108 is a phone.

The techniques, technologies, or media by which the components of thespeech-based speaker recognition system 100 intercommunicate arenumerous. For example, the speech-based speaker recognition system 100,or any portion thereof, may be part of a personal area network (PAN), alocal area network (LAN), a campus area network (CAN), a metropolitanarea network (MAP), or any other network type. Data communication medium134 between the access-protected entity 130 and the speech-based speakerauthenticator 102 may be any medium through which data can becommunicated. For example, the data communication medium 134 may bewired or wireless data connections, and may utilize a virtual privatenetwork (VPN), multi-protocol label switching (MPLS), the Internet, orany other data communication media.

The data communication medium 136 between the speech-based speakerauthenticator 102 and the communication device 108 may be of the same orsimilar type as any of the non-limiting examples provided for the datacommunication medium 134. In addition to the server 116 on which thespeech-based speaker authenticator 102 may be implemented, additionalintervening servers may facilitate data communication or storage withinthe speech-based speaker recognition system 100. Communication betweenthe communication device 108 and the speech-based speaker authenticator102 may also be via wireless communication. The wireless communicationmay be facilitated by an intervening base station (not shown). Wirelesscommunication between the communication device 108 and the speech-basedspeaker authenticator 102 may utilize any wireless standard forcommunicating data, such as COMA (e.g., cdmaOne or CDMA2000), GSM, 3G,4G, Edge, an over-the-air network, Bluetooth, etc.

In one example, the speech-based speaker recognition system 100 mayutilize the Internet, with any combination of the data communicationmedia 134, 136 representing a worldwide collection of networks andgateways that use the Transmission Control Protocol/Internet Protocol(TCP/IP) suite of protocols to communicate with one another. At theheart of the Internet is a backbone of high-speed data communicationlines between major nodes or host computers, consisting of thousands ofcommercial, governmental, educational, and other computer systems thatroute data and messages. FIG. 1 is intended as an example, and not as anarchitectural limitation for the different illustrative embodiments.

Referring to FIG. 3, an illustrative embodiment of the speech-basedspeaker recognition system 200 includes the speech-based speakerauthenticator 202, which includes a variety of modules and otherelements. Components of FIG. 3 that are analogous to components in FIGS.1 and 2 have been shown by indexing the reference numerals by 100. Asdescribed above, a speaker, such as the speaker 206, may speak referencespeech input 238 to the speech-based speaker authenticator 202 via thecommunication device 208 to form the reference recording 214. Thereference recording 214 may be stored in a recording storage 240, whichmay be implemented in any storage device, such as a hard drive, amemory, a cache, or any other device capable of storing data. Thereference passphrase 212 may then be used to verify the identity of anysubsequent speaker, which, in the example of FIG. 3, is the speaker 206.The recording storage 240 may also, in one embodiment, store profiles oraccounts associated with a speaker of the reference passphrase 212, suchas a prisoner account, a customer account, or any other type of account.

The speech-based speaker authenticator 202 includes a passphraserecognition module 242 that determines whether the test passphrase 210,which is spoken as test speech input 204 by the speaker 206, matches thereference passphrase 212. The test speech input 204, as well as the testpassphrase 210, may be stored on the recording storage 240 as a testrecording. In one embodiment, the passphrase recognition module 242 is aspeaker-independent authentication module that seeks to determine asimilarity between the test passphrase 210 and the reference passphrase212 without regard to the speaker from which each is spoken. Numerousmethods or techniques may be used to determine whether the testpassphrase 210 matches the reference passphrase 212. Examples of suchmethods may include Hidden Markov models, pattern matching algorithms,neural networks, and decision trees.

In one embodiment, the passphrase recognition module 242 employs afeature vector module 244 and a dynamic time warping module 246 todetermine whether the test passphrase 210 matches the referencepassphrase 212. In this embodiment, the feature vector module 244 mayconvert each of the test passphrase 210 and the reference passphrase 212into a test set of feature vectors and a reference set of featurevectors, respectively, each of which have a time dimension. As usedherein, the term “set” encompasses a quantity of one or more.Afterwards, the dynamic time warping module 246 may correlate, or align,the reference set of feature vectors with the test set of featurevectors over the time dimension, such as by using dynamic time warping.After correlating the feature vectors sets, a passphrase comparisonengine 248 may compare the test set of feature vectors to the referenceset of feature vectors to determine their similarity to one another, andtherefore whether the test passphrase 210 matches the referencepassphrase 212.

In one embodiment, prior to converting the test passphrase 210 and thereference passphrase 212 into a test set of feature vectors and areference set of feature vectors, respectively, the feature vectormodule 244 may pre-process each speech signal in the time domain byapplying leading and trailing background noise reduction to both thetest passphrase 210 and the reference passphrase 212. For example, thisnoise reduction pre-process step may use a power subtraction method,power reduction of a background noise, or other process as described in“Multi-Stage Spectral Subtraction for Enhancement of Audio Signals”,IEEE International Conference on Acoustics, Speech, and SignalProcessing, Volume 2, pp. II-805-808, May 2004 by Masatsugu Okazaki,Toshifumi Kunimoto, and Takao Kobayashi, which is hereby incorporated byreference in its entirety.

In one embodiment, the feature vector module 244, in the process ofconverting the test passphrase 210 and the reference passphrase 212 intofeature vectors sets, places the test passphrase 210 and the referencepassphrase 212 in the cepstral domain. The speech signal associated withthe test passphrase 210 and the reference passphrase 212 may be sampledby an analog-to-digital converter to form frames of digital values. ADiscrete Fourier Transform is applied to the frames of digital values toplace them in the frequency domain. The power spectrum is computed fromthe frequency domain values by taking the magnitude squared of thespectrum. Mel weighting is applied to the power spectrum and thelogarithm of each of the weighted frequency components is determined. Atruncated discrete cosine transform is then applied to form a cepstralvector for each frame. The truncated discrete cosine transform mayconvert a forty dimension vector that is present after the log functioninto a thirteen dimension cepstral vector. A thirteen-dimension cepstralvector may be generated for each of the test passphrase 210 and thereference passphrase 212. The thirteen-dimension cepstral vectors maythen be aligned by the dynamic time warping module 246, and compared toone another by the passphrase comparison engine 248.

In another embodiment, the test passphrase 210 and the referencepassphrase 212 may each be digital recordings that have an originalsampling rate. The test passphrase 210 and the reference passphrase 212may also be converted into digital format from another format, such asanalog. The digital recording containing the test passphrase 210 or thereference passphrase 212 may then be converted from the originalsampling rate to a conversion sampling rate. In one example, the digitalrecording containing the test passphrase 210 or the reference passphrase212 is converted to a 16-bit, 16 Kilohertz line pulse code modulationformat.

Thirteen-dimension Mel Cepstrum feature vectors may then be calculatedfor each 25 millisecond window of speech signal with a 10 millisecondframe rate using a Discrete Fourier Transform and one or more elementsor processes in the “SPHINX III Signal Processing Front EndSpecification”, Carnegie Mellon University Speech Group, Aug. 31, 1999by Michael Seltzer, which is hereby incorporated by reference in itsentirety. A set of front end processing parameters, each of which mayhave predetermined or customized values based on the embodiment, may beused by the feature vector module 244 in the feature vector conversionprocess. In one embodiment, the front end processing, or default,parameters may have the following values:

Sampling Rate: 16000.0 Hertz

Frame Rate: 100 Frames/Sec

Window Length: 0.025625 Sec

Filterbank Type: Mel Filterbank

Number of Cepstra: 13

Number of Mel Filters: 40

Discrete Fourier Transform Size: 512

Lower Filter Frequency: 133.33334 Hertz

Upper Filter Frequency: 6855.4976 Hertz

Pre-Emphasis α: 0.0

In one embodiment, the 13 Cepstra may include 12 cepstral (spectral)values and one (1st) value measuring the signal energy (or power).

The feature vector module 244 may apply a Finite Impulse Response (FIR)FIR pre-emphasis filter, such as the one below, to the input waveformthat corresponds to the test passphrase 210 or the reference passphrase212:

y[n]=x[n]−αx[n−l]

α may be user-defined or have the default value. This step may beskipped if α=0. A subsequent round of processing may use the appropriatesample of the input stored as a history value. In one embodiment, thepre-emphasis filter may utilize any filter, including an FIR, whichallows the filtering out of a part of a frequency spectrum, as describedin “Theory and Application of Digital Signal Processing”, Prentice Hall,Inc.: Englewood Cliffs, N.J., 1975 by Lawrence R. Rabiner and BernardGold, which is herein incorporated by reference in its entirety.

Next, a windowing process, a power spectrum process, a mel spectrumprocess, and a Mel Cepstrum process may be performed by the featurevector module 244 on a frame basis. In the windowing process, thefeature vector module 244 may multiply the frame by a Hamming window,such as the following:

${w\lbrack n\rbrack} = {0.54 - {0.46{\cos \left( \frac{2\pi \; n}{N - 1} \right)}}}$

wherein N is the length of the frame.

In the power spectrum process, the feature vector module 244 maydetermine the power spectrum of the frame by performing a DiscreteFourier Transform of length specified by the user, and then computingits magnitude squared. For example, the power spectrum process mayemploy the following equation:

S[k]=(real(X[k]))²+(imag(X[k]))²

In the mel spectrum process, the feature vector module 244 may determinea mel spectrum of the power spectrum computed above by multiplying thepower spectrum by triangular mel weighting filters and integrating theresult. The following equation may be employed by the mel spectrumprocess:

{tilde over (S)}[l]=Σ _(k=0) ^(N/2) S[k]M _(l) [k]l=0,1, . . . ,L−1

In this equation, N is the length of the Discrete Fourier Transform, andL is a total number of triangular mel weighting filters. Regarding thetriangular mel weighting factors, the mel scale filterbank is a seriesof L triangular bandpass filters, which corresponds to a series ofbandpass filters with constant bandwidth and spacing on a mel frequencyscale. When using a linear frequency scale, this filter spacing isapproximately linear up to 1 Kilohertz, and becomes logarithmic athigher frequencies. The following warping function may be used totransform linear frequencies to mel frequencies:

${{mel}(f)} = {2595{\log \left( {1 + \frac{f}{700}} \right)}}$

With regard to a plot of this warping function, a series of L triangularfilters with 50% overlap may be constructed such that they are equallyspaced on the mel scale spanning [mel(f_(min)), mel(f_(max))]. f_(min)and f_(max) may be user-defined or set to the default values.

In the mel cepstrum process, the feature vector module 244 may apply aDiscrete Fourier Transform to the natural logarithm of the mel spectrum,calculated in the mel spectrum process, to the obtain the mel cepstrum:

${c\lbrack n\rbrack} = {\sum\limits_{i = 0}^{L - 1}{\ln \left( {\overset{\sim}{S}\lbrack i\rbrack} \right){\cos \left( {\frac{\pi \; n}{2L}\left( {{2i} + 1} \right)} \right)}}}$c = 0, 1, …  , C − 1

C is the number of cepstral coefficients, which may be outputted by theprocess, and the cepstral coefficients may be 32-bit floating pointdata. In one embodiment, the resulting sequence of thirteen-dimensionfeature vectors for each 25 milliseconds of digitized speech samples (25millisecond frames) with a 10 millisecond frame rate may be stored as areference set of feature vectors and a test set of feature vectors forthe reference recording 214 and the test recording, respectively.

Once a test set of feature vectors and a reference set of featurevectors are obtained for the test passphrase 210 and the referencepassphrase 212, respectively, these feature vector sets may becorrelated, aligned, or warped with respect to one another along a timedimension so that the passphrase recognition module 242 can betterdetermine similarities between the test passphrase 210 and the referencepassphrase 212. A process called dynamic time warping may be used tocorrelate the test set of feature vectors and the reference of featurevectors with one another. Dynamic time warping may be used to measurethe similarity between two sequences which vary in time or speed.Dynamic time warping helps to find an optimal match between two givensequences (e.g., feature vectors that correspond to the test passphrase210 and the reference passphrase 212) with certain restrictions. In oneapplication of dynamic time warping, the sequences may be “warped”non-linearly in the time dimension to determine a measure of theirsimilarity independent of certain non-linear variations in the timedimension. Dynamic time warping can help to explain variability in theY-axis by warping the X-axis.

In one embodiment, the reference set of feature vectors and the test setof feature vectors may be represented by two time series Q and C havingrespective lengths of n and m:

Q=q ₁ ,q ₂ , . . . ,q _(i) , . . . ,q _(n)

C=c ₁ ,c ₂ , . . . ,c _(j) , . . . ,c _(m)

In one non-limiting example, each feature vector may representapproximately 10 milliseconds of speech data (with a rate of 100 framesper second).

To correlate, or align, the two sequences using dynamic time warping,the dynamic time warping module 246 may construct an n-by-m matrix,where the (i^(th), j^(th)) elements of the matrix contains the distanced(q_(i),c_(j)) between the two points q_(i) and c_(j). A Euclideandistance may be used, such that d(q_(i),c_(j))=(q_(i)−c_(j))². Each ofthe matrix elements (i,j) corresponds to the alignment between thepoints q_(i) and c_(j). The dynamic time warping module 246 may thendetermine a warping path W. The warping path W is a contiguous set ofmatrix elements that defines a mapping between Q and C. When the k^(th)element of W is defined as w_(k)=(i,j)_(k), the following relation maybe used:

W=w ₁ ,w ₂ , . . . ,w _(k) , . . . ,w _(K)

max(m,n)≦K<m+n−1

In one embodiment, the dynamic time warping module 246 may subject thewarping path W to one or more constraints. For example, the dynamic timewarping module 246 may require the warping path W to start and finish indiagonally opposite corners cells of the matrix. Such a boundaryconstraint may be expressed as w₁=(1,1) and w_(K)=(m,n). The dynamictime warping module 246 may also restrict the allowable steps in thewarping path W to adjacent cells, including diagonally adjacent cells.Such a continuity constraint may be expressed as:

Given w _(k)=(a,b) then w _(k-l)=(a′,b′)

where a−a′≦1 and b−b′≦1

The dynamic time warping module 246 may also force the points in thewarping path W to be monotonically spaced in time. Such a monotonicityconstraint may be expressed as:

Given w _(k)=(a,b) then w _(k-l)=(a′,b′)

where a−a′≧0 and b−b′≧0

In one embodiment, the dynamic time warping module 246 may use thefollowing equation to minimize the warping cost when determining thewarping path W:

${{DTW}\left( {Q,C} \right)} = {\min \left\{ \frac{\sqrt{\sum_{k = 1}^{K}w_{k}}}{K} \right.}$

K may help to compensate for warping paths having different lengths.

The dynamic time warping module 246 may find the warping path with aminimized warping cost by using dynamic programming to evaluate thefollowing recurrence, which defines the cumulative distance γ(i,j) asthe distance d(i,j) found in the current cell and the minimum of thecumulative distances of the adjacent elements:

γ(i,j)=d(q _(i) ,c _(j))+min{γ(i−1,j−1),γ(i−1,j),γ(i,j−1)}

Various methods may be used by the dynamic time warping module 246 toaddress the problem of singularities, including windowing, slopeweighting, and step patterns (slope constraints).

In another embodiment, the dynamic time warping module 246 may use oneor more elements of a derivative dynamic time warping process.Derivative dynamic time warping may be useful when two sequences differin the Y-axis in addition to local accelerations and decelerations inthe time axis. In one example, the dynamic time warping module 246 mayuse one or more elements or processes of the derivative dynamic timewarping described in “Derivative Dynamic Time Warping”, First SIAMInternational Conference on Data Mining (SDM'2001), 2001, Chicago, Ill.,USA by Eamonn J. Keogh and Michael J. Pazzani, which is herebyincorporated by reference in its entirety.

Derivative dynamic time warping differs from some other types of dynamictime warping, such as the dynamic time warping example given above, inthat derivative dynamic time warping does not consider only the Y-valuesof the data points for which a correlation is sought, but ratherconsiders the higher-level features of “shape”. Information about shapeis obtained using the first derivative of the sequences.

The dynamic time warping module 246 may generate an n-by-m matrixwherein the (i^(th), j^(th)) element of the matrix contains the distanced(q_(i),c_(j)) between the two points q_(i) and c_(j). In contrast tothe dynamic time warping example given above, the distance measured(q_(i),c_(j)) is not Euclidean, but rather the square of the differenceof the estimated derivatives of q_(i) and c_(j). The following estimatemay be used to obtain the derivative:

${D_{x}\lbrack q\rbrack} = {{\frac{\left( {q_{i} - q_{i - 1}} \right) + \left( {\left( {q_{i + 1} - q_{i - 1}} \right)/2} \right)}{2}\mspace{14mu} 1} < i < m}$

This estimate is the average of the slope of the line through the pointin question and its left neighbor, and the slope of the line through theleft neighbor and the right neighbor. The dynamic time warping module246 may use exponential smoothing before attempting to estimate thederivatives, especially for noisy datasets. The distance measurementcalculated by using the above derivative estimate may then be used bydynamic time warping processes, including the dynamic time warpingprocess described in the previous examples.

In one embodiment, prior to applying the Derivative Dynamic Time Warpingmethod described above, all frames in reference and test passphrases212, 210, and in particular the reference and test sets of featurevectors, are classified as voiced or silent frames, based on the energyof each frame. In one example, the energy, or power, may be one of thevalues in the 13 Cepstra discussed above, such as the first dimensionvalue of the 13 Cepstra. An energy threshold may be used to classify agiven frame as voiced or silent, and the energy threshold may beconfigured as a function of the average energy level. For example, eachframe in the reference and test sets of feature vectors may be comparedto the energy threshold such that frames having an energy level thatexceeds the energy threshold (e.g., the average energy level) areclassified as voiced frames, while frames having an energy level that isless than the energy threshold are classified as silent frames.

For purposes of classifying the frames as voiced or silent, the test andreference passphrases 210, 212 may be assumed to be similar oridentical. Thus, the voiced frames of each of the reference set offeature vectors and test set of feature vectors should be somewhatsimilar, but not necessarily identical, in length when the testpassphrase 210 is the same as the reference passphrase 212. Using thisassumption, the passphrase comparison engine 248 may compare the voicedreference set of feature vectors to the voiced test set of featurevectors to determine whether the test passphrase 210 matches thereference passphrase 212. In one particular embodiment, the passphrasecomparison engine 248 may determine a length ratio that is ratio of thelength of the voiced reference set of feature vectors to the length ofthe voiced test set of feature vectors. The test passphrase 210 may bedetermined to match the reference passphrase 212 if the length ratio iswithin a predetermined ratio, such as 1:1.1, 1:1.5, or any other ratio.On the other hand, the passphrase comparison engine 248 may declare amismatch between the test passphrase 210 and the reference passphrase212 if the length ratio exceeds a predetermined ratio. In this manner,the length ratio may be used to guard against attempts to find a match,or actual match determinations, between reference and test feature setsof grossly, or otherwise user-intolerably, different lengths. Inaddition, the length ratio metric may be provided as a configurableinput parameter. Anytime after classifying the frames as voiced orsilent, the dynamic time warping module 246 applies the derivativedynamic time warping method only to the sequences of voiced feature setsin reference and test passphrases 212, 210.

After processing of the test passphrase 210 and the reference passphrase212, such as by the feature vector module 244 and the dynamic timewarping module 246, the passphrase comparison engine 248 may thencompare the test passphrase 210 to the reference passphrase 212 todetermine whether the test passphrase 210 matches the referencepassphrase 212. As described above, in one embodiment, the testpassphrase 210 and the reference passphrase 212 may each be converted toa set of feature vectors and correlated with respect to one anotherusing dynamic time warping, after which the passphrase comparison engine248 compares the reference set of feature vectors to the test set offeature factors to determine whether the reference passphrase 212matches the test passphrase 210. If the passphrase comparison engine 248determines that there is a match between the test passphrase 210 and thereference passphrase 212, the passphrase comparison engine 248 mayoutput such determination to another module in the speech-based speakerauthenticator 202, and this determination may be used by thespeech-based speaker authenticator 202 to determine whether the speaker206 is the same speaker that spoke the reference passphrase 212.

The passphrase comparison engine 248 may also include a passphrase matchscoring module 250, which allows a user to specify one or morethresholds to determine when a “successful” or “failed” match is foundby the passphrase comparison engine 248. For example, the passphrasematch scoring module 250 may allow a user to “loosen” or “tighten” thestringency with which the reference set of feature vectors is comparedto the test set of feature vectors, such that when the comparisonstandard is loosened, reference and test sets of feature vectors thatare relatively dissimilar will be determined to be a match when a matchwould not have been declared under a more tightened standard.

In one embodiment, each derivative dynamic time warping process,described in further detail above, outputs a floating point value (e.g.,0.8775). This floating point value may be defined as a minimalcumulative distance DTW(Q,C) normalized by K. DTW(Q,C) and K have beendefined above. The passphrase match scoring module 250 may furtherdefine scoring weights or coefficients that apply to DTW(Q,C) dependingon a cumulative length of the test and reference passphrases 210, 212.These scoring weights determine a threshold to be applied to estimate ifa match was ‘successful’ or ‘failed’.

The ability to adjust the stringency with which to declare a matchbetween the test passphrase 210 and the reference passphrase 212provides versatility to the speech-based speaker authenticator 202. Forexample, in high-security environments, such as a prison, where there islittle margin for error, a higher standard may be desired to minimizethe risk that the speaker 206 is falsely identified as the speaker ofthe reference passphrase 212. In environments where security is lessimportant, the speech-based speaker authenticator 202 may be used toloosen the standard of comparison between the test passphrase 210 andthe reference passphrase 212 to minimize scenarios in which a failedmatch occurs even when the test passphrase 210 is the same as thereference passphrase 212.

In another embodiment, the passphrase match scoring module 250determines a score based on the similarity between the test passphrase210 and the reference passphrase 212. The passphrase comparison module248 may then use the score to determine whether the test passphrase 210matches the reference passphrase 212. In one embodiment, the score,which, in one example, indicates the similarity between the referenceset of feature vectors and the test set of feature vectors, may becompared to a match threshold. Whether the reference set of featurevectors matches the test set of feature vectors, and as a result,whether the reference passphrase 212 matches the test passphrase 210, isbased on the comparison between the score and the match threshold. Thematch threshold may be user-definable to allow the user to adjust thelooseness or tightness of the comparison.

By way of non-limiting example, the similarity between a reference setof feature vectors and a test set of feature vectors may be given ascore between 0 and 100, where 0 indicates complete dissimilarity and100 indicates an exact match between the reference and test set offeature vectors. In this example, a user may define a match thresholdanywhere from 0 to 100. If the user selects a match threshold of 40, forexample, a match between the reference set of feature vectors and thetest set of feature vectors will be determined if the score meets orexceeds the match threshold of 40. If the user selects a match thresholdof 90, more stringent match criteria will apply, and a match between thereference set of feature vectors and the test set of feature vectorswill be found only if the score meets or exceeds 90.

Other types of scoring structures may be employed to allow variabilityin the match determination conducted by the passphrase comparison engine248. For example, the passphrase match scoring module 250 may employ twoor more reference sets of feature vectors that are converted from two ormore respective reference speech inputs 238 containing the samereference passphrase 212. The passphrase match scoring module 250 maycompare the test set of feature vectors to the multiple reference setsof feature vectors stored by the recording storage 240. In particular,the passphrase match scoring module 250 may determine a score thatcorresponds to one of the following scenarios: (1) the test set offeature vectors matches, within a predetermined tolerance, the multiplereference sets of feature vectors, and (2) the test set of featurevectors matches, within a predetermined tolerance, any one of themultiple reference sets of feature vectors, or (3) the test set offeature vectors matches, within a predetermined tolerance, any one ofthe multiple reference sets of feature vectors in addition to anexternal boundary condition (e.g., a noisy environment or a reference ortest speaker known to be speech-impaired). A match may be declaredbetween the test passphrase 210 and the reference passphrase 212 for anyone of these scenarios depending on the desired stringency with which tocompare the test passphrase 210 to the reference passphrase 212.

The speech-based speaker authenticator 202 also includes a voice featurerecognition module 252, which compares one or more voice features of thespeaker 206 to one or more voice features of a speaker of the referencepassphrase 212. The voice feature recognition module 252 may beconsidered to be speaker-dependent since the comparison performed by thevoice feature recognition module 252 depends on the voice features ofthe speakers that are compared.

In one embodiment, the voice feature recognition module 252 includes afundamental frequency module 254 that estimates, or determines, afundamental frequency, or pitch, of both the reference recording 214 andthe test recording. In voice feature recognition algorithms, the term“pitch” may be used to describe the fundamental frequency of a voicesample. Also, the fundamental frequency may be defined as the rate ofvibration of the vocal folds.

Estimation of the fundamental frequency of the test recording containingthe test passphrase 210 and the reference recording 214 to determine atest fundamental frequency and a reference fundamental frequency,respectively, may be performed using any technique, such as theautocorrelation methods, including pitch detection estimation (pda),frequency auto-correlation estimation (fxac), autocorrelationcoefficient function (acf), normalized autocorrelation coefficientfunction (nacf), additive estimation, or any other fundamental frequencycorrelation method.

In one embodiment, estimation of the fundamental frequency of the voiceof the speaker 206 of the test passphrase 210 and voice of the speakerof the reference passphrase 212 may be performed using all or part ofthe YIN fundamental frequency estimation method described in “YIN, aFundamental Frequency Estimator for Speech and Music”, Journal of theAcoustical Society of America, April 2002, 1917-1930, Volume 111, Issue4 by Alain de Cheveigne and Hideki Kawahara, which is hereinincorporated by reference in its entirety. YIN includes several steps,including an initial step that includes an autocorrelation function, andsubsequent steps that seek to reduce error rates. In implementing YIN,the fundamental frequency module 254 may determine the autocorrelationfunction of a discrete speech signal x_(t), such as a test or referencerecording, using the following equation:

r _(t)(τ)=Σ_(j=t+1) ^(t+W) x _(j) x _(j)+τ

wherein r_(t)(τ) is the autocorrelation function of lag τ calculated attime index t, and W is the integration window size. The autocorrelationmethod compares the signal to its shifted self. Also, theautocorrelation function is the Fourier transform of the power spectrum,and may be considered to measure the regular spacing of harmonics withinthat spectrum.

The next step in YIN involves a difference function, in which thefundamental frequency module 254 models the signal x_(t) as a periodicfunction with period T, by definition invariant for a time shift of T:

x _(t) −x _(t+T)=0∀t

The same is true after taking the square and averaging over a window:

Σ_(j=t+1) ^(t+W)(x _(j) −x _(j+T))²=0

Conversely, an unknown period may be found by forming the differencefunction:

d _(t)(τ)=Σ_(j=1) ^(W)(x _(j) −x _(j+τ))²

and searching for the values of τ for which the function is zero. Aninfinite set of values for which the function is zero exists, and thesevalues are all multiples of the period. The squared sum may be expanded,and the function may be expressed in terms of the autocorrelationfunction:

d _(t)(τ)=r _(t)(0)+r _(t+τ)(0)−2r _(t)(τ)

The first two terms are energy terms. If these first two terms wereconstant, the difference function d_(t)(τ) would vary as the opposite ofr_(t)(τ), and searching for a minimum of one or the maximum of the otherwould give the same result. The second energy term also varies with τ,implying that maxima of r_(t)(τ) and minima of d_(t)(τ) may sometimesnot coincide. In one embodiment, the difference function d_(t)(τ) mayreplace the autocorrelation function to yield a lower error, and allowfor the application of the subsequent steps in YIN.

In the third step of YIN, the fundamental frequency module 254 mayreplace the difference function by the “cumulative mean normalizeddifference function”:

${d_{t}^{\prime}(\tau)} = \left\{ {\begin{matrix}{1,} & {{{if}\mspace{14mu} \tau} = 0} \\{{d_{t}(\tau)}/\left\lbrack {\left( \frac{1}{\tau} \right){\sum\limits_{j = 1}^{\tau}{d_{t}(j)}}} \right\rbrack} & {otherwise}\end{matrix}.} \right.$

The cumulative mean normalized difference function is obtained bydividing each value of the old function by its average over shorter-lagvalues.

In the fourth step of YIN, the fundamental frequency module 254 may setan absolute threshold and choose the smallest value of τ that gives aminimum of d′ deeper than that threshold. If none is found, the globalminimum is chosen instead. If the period is the smallest positive memberof a set, the threshold determines the list of candidates admitted tothe set, and may be considered to be the proportion of aperiodic powertolerated within a “periodic” signal. By way of illustration, considerthe identity:

2(x _(t) ² +x _(t+T) ²)=(x _(t) +x _(t+T))²+(x _(t) −x _(t+T))²

Taking the average over a window and dividing by 4,

1/(2W)Σ_(j=t+1) ^(t+W)(x _(j) ² +x _(j+T) ²)=

1/(4W)Σ_(j=t+1) ^(t+W)(x _(j) +x _(j+T))²+1/(4W)

XΣ _(k=t+1) ^(t+W)(x _(j) −x _(j+T))²

The power of the signal is approximated by the left-hand side. The twoterms on the right-hand side constitute a partition of this power. Ifthe signal is periodic with period T, the second of the two terms on theright-hand side is zero, and is unaffected by adding or subtractingperiodic components at that period. The second of the two terms on theright-hand side may be interpreted as the “aperiodic power” component ofthe signal power. When τ=T, the numerator of the cumulative meannormalized difference function described above is proportional toaperiodic power whereas its denominator, average of d(τ) for τ between 0and T, is approximately twice the signal power. Therefore, d′ (T) isproportional to the aperiodic/total power ratio. If this ratio is belowthreshold, a candidate T is accepted into the set. Error rates may notbe critically affected based on the exact value of this threshold.

In the fifth step of YIN, the fundamental frequency module 254 mayemploy parabolic interpolation. In particular, the fundamental frequencymodule 254 may fit each local minimum of d′ (τ) and its immediateneighbors by a parabola. The fundamental frequency module 254 may usethe ordinate of the interpolated minimum in the dip-selection process.The abscissa of the selected minimum may then serve as a periodestimate. An estimate obtained in this way may be slightly biased. Toavoid this bias, the abscissa of the corresponding minimum of the rawdifference function d(τ) is used instead.

For non-stationary speech intervals, it may be found that the estimatefails at a certain phase of the period that usually coincides with arelatively high value of d′ (T_(t′)), wherein T_(t) is the periodestimate at time t. At another phase (time t′), the estimate may becorrect and the value of d′ (T_(t′)) smaller. In the sixth step of YIN,the fundamental frequency module 254 takes advantage of this fact bysearching around the vicinity of each analysis point for a betterestimate. In particular, for each time index t, the fundamentalfrequency module 254 may search for a minimum of d′_(θ)(T_(θ)) for θwithin a small interval [t−T_(max)/2, t+T_(max)/2], wherein T_(θ) is theestimate at time θ and T_(max) is the largest expected period. Based onthis initial estimate, the fundamental frequency module 254 may applythe estimation algorithm again with a restricted search range to obtainthe final estimate. By way of non-limiting example, using T_(max)25milliseconds and a final search range of ±20% of the initial estimate,step six of YIN may reduce the error rate to 0.5% (from 0.77%). Whilestep six of YIN may be considered to be associated with median smoothingor dynamic programming techniques, it differs in that it takes intoaccount a relatively short interval and bases its choice on qualityrather than mere continuity.

Referring to the steps of YIN as a whole, replacing the autocorrelationfunction (step 1) by the difference function (step 2) opens the way forthe cumulative mean normalization operation (step 3), upon which arebased the threshold scheme (step 4) and the measure of d′ (T) thatselects the best local estimate (step 6). While parabolic interpolation(step 5) may be considered independent from the other steps, it doesrely on the spectral properties of the autocorrelation function (step1). The fundamental frequency module 254 may utilize any combination ofthese steps of YIN, and in any order.

The voice feature recognition module 252 includes a voice featurecomparison engine 256 that compares a voice feature of the speaker 206of the test passphrase 210 to a voice feature of the speaker of thereference passphrase 212. For example, the voice feature comparisonengine 256 may compare the fundamental frequency or pitch of the testspeech input 204 (the test fundamental frequency) with the fundamentalfrequency or pitch of the reference speech input 238 (the referencefundamental frequency) to determine whether the speaker 206 of the testspeech input 204 matches the speaker of the reference speech input 238.

Whether a match is found between the test fundamental frequency and thereference fundamental frequency may depend on the level of similaritybetween these fundamental frequencies that is required before the voicefeature comparison engine 256 determines that a match has been found. Avoice feature match scoring module 258 may be included in the voicefeature comparison engine 256 to give the user some control over thestringency with which the test fundamental frequency and the referencefundamental frequency are compared. In similar manner, the voice featurematch scoring module 258 may be used to adjust the stringency with whichany other voice feature of the speaker 206 and the speaker of thereference passphrase 212 is compared to determine a match. For example,the voice feature match scoring module 258 may allow a user to “loosen”or “tighten” the stringency with which the reference fundamentalfrequency is compared to the test fundamental frequency, such that whenthe comparison standard is loosened, reference and test fundamentalfrequencies that are relatively dissimilar will be determined to be amatch when a match would not have been declared under a more tightenedstandard. Like the passphrase match scoring module 250 described above,the ability to adjust the stringency with which, to declare a matchbetween voice features of the speaker 206 and the speaker of thereference passphrase 212 provides versatility to the speech-basedspeaker authenticator 202, and allows the speech-based speakerauthenticator 202 to be used in a wide variety of environments.

In one embodiment, the voice feature match scoring module 258 maydetermine a score based on the similarity between the test fundamentalfrequency and the reference fundamental frequency. The score indicatesthe similarity between the test fundamental frequency and the referencefundamental frequency. The voice feature match scoring module 258 maythen use the score to determine whether the speaker 206 is the same as aspeaker of the reference passphrase 212.

For example, the fundamental frequency module 254 may estimate thefundamental frequency, or pitch, for each voiced frame of the referenceand test passphrases 212, 210 to form a set of reference fundamentalfrequency values and a set of test fundamental frequency values,respectively. In one non-limiting example, each voiced frame is 25milliseconds, although other frame times may be used. Also, thedetermination of voiced, versus silent, frames in the reference and testpassphrases 212, 210 as discussed above may be used.

The two sets of estimated fundamental frequency values yielded by thefundamental frequency module 254 may be compared to determine a matchingscore. A preconfigured number of local peak fundamental frequency valuesmay be identified and excluded from comparison to avoid the possibilityof octave errors that may be inherently present as a result of YINprocessing, thus forming a modified set of reference fundamentalfrequency values and a modified set of test fundamental frequencyvalues. Further, the voice feature comparison engine 256 may determine aresulting distance measure between the original or modified referenceand test passphrase fundamental frequency value sets using eitherEuclidean or Itakura distance metrics, the resulting distance measurerepresenting a matching score between the test and reference passphrases210, 212. Further, the voice feature comparison engine 256 may use a setof one or more user-definable preconfigured matching thresholds toestimate a “successful” or “failed” match between the speaker 206 andthe speaker of the reference passphrase 212. Whether the testfundamental frequency matches the reference fundamental frequency, andas a result, whether the speaker 206 matches the speaker of thereference passphrase 212, is based on the comparison between the scoreand the match threshold. For example, if the resulting distance measure,or score, exceeds a preconfigured matching threshold, then a mismatchmay be declared by the voice feature comparison engine 256.

In another illustrative embodiment, the similarity between the referencefundamental frequency and the test fundamental frequency may be given ascore between 0 and 100, where 0 indicates complete dissimilarity and100 indicates an exact match between the reference fundamental frequencyand the test fundamental frequency. In this example, the user may definea match threshold anywhere from 0 to 100. If the user selects a matchthreshold of 40, for example, a match between the reference fundamentalfrequency and the test fundamental frequency will be determined if thescore meets or exceeds the match threshold of 40. If the user selects amatch threshold of 90, more stringent match criteria will apply, and amatch between the reference fundamental frequency and the testfundamental frequency will be found only if the score meets or exceeds90.

The fundamental frequency module 254 may utilize YIN, or any otherfundamental frequency or pitch estimation method, to determine thefundamental frequency or pitch of a test recording that includes thetest passphrase 210 and the reference recording 214. Other voicefeatures of the speaker 206 and the speaker of the reference passphrase212 may also be measured and used for comparison purposes.

The voice feature match scoring module 258 may employ two or morereference fundamental frequencies that are converted from two or morerespective reference speech inputs 238 containing the same referencepassphrase 212. The voice feature match scoring module 258 may comparethe test fundamental frequency to the multiple reference fundamentalfrequencies stored by the recording storage 240. In particular, thevoice feature match scoring module 258 may determine a score thatcorresponds to one of the following scenarios: (1) the test fundamentalfrequency matches, within a predetermined tolerance, the multiplereference fundamental frequencies, and (2) the test fundamentalfrequency matches, within a predetermined tolerance, any one of themultiple reference fundamental frequencies, or (3) the test fundamentalfrequency matches, within a predetermined tolerance, any one of themultiple reference fundamental frequencies in addition to an externalboundary condition (e.g., a noisy environment or a reference or testspeaker known to be speech-impaired). A match may be declared betweenthe speaker 206 and the speaker of the reference passphrase 212 for, anyone of these scenarios depending on the desired stringency with which tocompare these speakers.

Although the passphrase comparison engine 248 and the voice featurecomparison engine 256 are shown to be separate elements included in eachof the passphrase recognition module 242 and the voice featurerecognition module 252, respectively, the passphrase comparison engine248 may be combined into a single module with the voice featurecomparison engine 256, and this combined module may be separate or apart of any element of the speech-based speaker authenticator 202.

In one embodiment, if both the passphrase recognition module 242 and thevoice feature recognition module 252 determine that a match has beenfound, the speaker 206 will be authenticated as being the same personthat spoke the reference passphrase 212. In particular, if thepassphrase recognition module 242 determines that the test passphrase210 spoken by the speaker 206 matches the reference passphrase 212, andthe voice feature recognition module 252 determines that the speaker 206is the same, or matching, person that spoke the reference passphrase 212based on a voice feature analysis, then the speech-based speakerauthenticator 202 authenticates the speaker 206. In another embodiment,the speech-based speaker authenticator 202 may authenticate the speaker206 if a match is found by only one of the passphrase recognition module242 or the voice feature recognition module 252. As indicated above,whether a match is found by either of these modules may be customized bya user to allow for varying levels of comparison stringency, such as byuse of the voice feature match scoring module 258 or the voice featurematch scoring module 258. For example, the match threshold for eachscoring module may differ to customize the weight given to each of thepassphrase recognition module 242 and the voice feature recognitionmodule 252.

Whether or not the speaker 206 is authenticated may be included as datain the speaker authentication data 228, which may be sent to theaccess-protected entity 230 for further processing. In anotherembodiment, the speech-based speaker authenticator 202 may itselfprovide access to any product, service, entity, etc. based on whetherthe speaker 206 is authenticated.

Referring to FIG. 4, an illustrative embodiment of a process forauthenticating a speaker that is executable by a speech-based speakerauthenticator, such as the speech-based speaker authenticator 102 or 202in FIG. 1 or 3, respectively, includes receiving reference speech inputthat includes a reference passphrase (step 301). The process receivestest speech input that includes a test passphrase (step 303). The testspeech input may be received at any time after the reference speechinput is received.

The process determines whether the reference passphrase matches the testpassphrase (step 307). If the process determines that the referencepassphrase does not match the test passphrase, the process determinesthat the speaker of the test speech input is not authenticated (step313). The process then determines whether to provide another opportunityto authenticate a speaker, such as the last speaker to have spoken thetest passphrase (step 315). If the process determines to provide anotheropportunity to authenticate the speaker, the process returns to step303. If the process determines not to provide another opportunity toauthenticate the speaker, the process then terminates.

Returning to step 307, if the process determines that the referencepassphrase matches the test passphrase, the process determines whetherthe voice features of the speaker of the reference speech input matchthe voice features of the speaker of the test speech input (step 309).If the process determines that the voice features of the speaker of thereference speech input does match the voice features of the speaker ofthe test speech input, the process determines that the speaker of thetest speech input is authenticated (step 311). Returning to step 309, ifthe process determines that the voice features of the speaker of thereference speech input does not match the voice features of the speakerof the test speech input, the process proceeds to step 313, in which thespeaker of the test speech input is not authenticated.

Referring to FIG. 5, an illustrative embodiment of a process forauthenticating a speaker that is executable by a speech-based speakerauthenticator, such as the speech-based speaker authenticator 102 or 202in FIG. 1 or 3, respectively, includes receiving reference speech inputthat includes a reference passphrase to form a reference recording (step401). The process determines a reference set of feature vectors for thereference recording (step 403). The process receives test speech inputthat includes a test passphrase to form a test recording (step 405). Theprocess determines a test set of feature vectors for the test recording(step 407). The process correlates the reference set of feature vectorswith the test set of feature vectors over time, such as by using dynamictime warping, derivative dynamic time warping, or another dynamic timewarping method (step 409). The process compares the reference set offeature vectors with the test set of feature vectors (step 411).

The process determines whether the reference passphrase matches the testpassphrase based on the feature vector comparison (step 413). If theprocess determines that the reference passphrase does not match the testpassphrase, the process determines that the speaker of the test speechinput is not authenticated (step 415). Returning to step 413, if theprocess determines that the reference passphrase matches the testpassphrase, the process determines a reference fundamental frequency ofthe reference recording (step-417). The process determines a testfundamental frequency of the test recording (step 419). The process thencompares the reference fundamental frequency to the test fundamentalfrequency (step 421).

The process determines whether the speaker of the test speech inputmatches the speaker of the reference speech input (step 423). If theprocess determines that the speaker of the test speech input does notmatch the speaker of the reference speech input, the process determinesthat the speaker of the test speech input is not authenticated.Returning to step 423, if the process determines that the speaker of thetest speech input matches the speaker of the reference speech input, theprocess authenticates the speaker of the test speech input (step 425).

Referring to FIG. 6, an illustrative embodiment of a process thatutilizes a length ratio to compare a test passphrase to a referencepassphrase is shown. The process is executable by the passphraserecognition module 242 in FIG. 3, and may be performed prior todetermining the test and reference sets of feature vectors as describedin steps 403 and 407 of FIG. 5. The process includes classifying eachframe in the reference set of feature vectors and the test set offeature vectors as one of a voiced frame or a silent frame to form avoiced reference set of feature vectors and a voiced test set of featurevectors (step 501). The process includes comparing the voiced referenceset of feature vectors to the voiced test set of feature vectors todetermine a length ratio (step 503). The process also includesdetermining whether the test passphrase is different from the referencepassphrase based on the length ratio (step 505).

Referring to FIG. 7, an illustrative embodiment of a process thatdetermines, modifies, and compares reference and test fundamentalfrequency values is shown. The process is executable by the voicefeature recognition module 252 in FIG. 3, and provides a non-limitingexample of the details of steps 417 through 423 of FIG. 5. The processincludes determining a set of reference fundamental frequency values fora reference recording (step 551), and determining a set of testfundamental frequency values for a test recording (step 553). Theprocess includes identifying a set of local peak fundamental frequencyvalues in the set of reference fundamental frequency values and the setof test fundamental frequency values (step 555). The process alsoincludes excluding the set of local peak fundamental frequency valuesfrom the set of reference fundamental frequency values and the set oftest fundamental frequency values to form a modified set of referencefundamental frequency values and a modified set of test fundamentalfrequency values (step 557). The process includes determining aresulting distance measure between the modified set of referencefundamental frequency values and the modified set of test fundamentalfrequency values to form a matching score (step 559). The process alsoincludes comparing the matching score to a preconfigured matchingthreshold to determine whether the speaker of the test speech inputmatches the speaker of the reference speech input (step 561).

The flowcharts and block diagrams in the different depicted embodimentsillustrate the architecture, functionality, and operation of somepossible implementations of apparatus, methods and computer programproducts. In this regard, each block in the flowchart or block diagramsmay represent a module, segment, or portion of code, which comprises oneor more executable instructions for implementing the specified functionor functions. In some alternative implementations, the function orfunctions noted in the block may occur out of the order noted in theFigures. For example, in some cases, two blocks shown in succession maybe executed substantially concurrently, or the blocks may sometimes beexecuted in the reverse order, depending upon the functionalityinvolved.

Referring to FIG. 8, a block diagram of a computing device 602 is shownin which illustrative embodiments may be implemented. The computingdevice 602 may implement the speech-based speaker authenticator 102 or202 in FIG. 1 or 3, respectively. Computer-usable program code orinstructions implementing the processes used in the illustrativeembodiments may be located on the computing device 602. The computingdevice 602 includes a communications fabric 603, which providescommunications between a processor unit 605, a memory 607, a persistentstorage 609, a communications unit 611, an input/output (I/O) unit 613,and a display 615.

The processor unit 605 serves to execute instructions for software thatmay be loaded into the memory 607. The processor unit 605 may be a setof one or more processors or may be a multi-processor core, depending onthe particular implementation. Further, the processor unit 605 may beimplemented using one or more heterogeneous processor systems in which amain processor is present with secondary processors on a single chip. Asanother illustrative example, the processor unit 605 may be a symmetricmulti-processor system containing multiple processors of the same type.

The memory 607, in these examples, may be, for example, a random accessmemory or any other suitable volatile or non-volatile storage device.The persistent storage 609 may take various forms depending on theparticular implementation. For example, the persistent storage 609 maycontain one or more components or devices. For example, the persistentstorage 609 may be a hard drive, a flash memory, a rewritable opticaldisk, a rewritable magnetic tape, or some combination of the above. Themedia used by the persistent storage 609 also may be removable. Forexample, a removable hard drive may be used for the persistent storage609. In one embodiment, the recording storage 240 in FIG. 3 may beimplemented on the memory 607 or the persistent storage 609.

The communications unit 611, in these examples, provides forcommunications with other data processing systems or communicationdevices. In these examples, the communications unit 611 may be a networkinterface card. The communications unit 611 may provide communicationsthrough the use of either or both physical and wireless communicationlinks.

The input/output unit 613 allows for the input and output of data withother devices that may be connected to the computing device 602. Forexample, the input/output unit 613 may provide a connection for userinput through a keyboard and mouse. Further, the input/output unit 613may send output to a processing device. In the case in which thecomputing device 602 is a cellular phone, the input/output unit 613 mayalso allow devices to be connected to the cellular phone, such asmicrophones, headsets, and controllers. The display 615 provides amechanism to display information to a user, such as a graphical userinterface.

Instructions for the operating system and applications or programs arelocated on the persistent storage 609. These instructions may be loadedinto the memory 607 for execution by the processor unit 605. Theprocesses of the different embodiments may be performed by the processorunit 605 using computer-implemented instructions, which may be locatedin a memory, such as the memory 607. These instructions are referred toas program code, computer-usable program code, or computer-readableprogram code that may be read and executed by a processor in theprocessor unit 605. The program code in the different embodiments may beembodied on different physical or tangible computer-readable media, suchas the memory 607 or the persistent storage 609.

Program code 617 is located in a functional form on a computer-readablemedia 619 and may be loaded onto or transferred to the computing device602 for execution by the processor unit 605. The program code 617 andthe computer-readable media 619 form computer program product 621 inthese examples. In one embodiment, the computer program product 621 isthe speech-based speaker authenticator 102 or 202 in FIG. 1 or 3,respectively. In this embodiment, the computing device 602 may be theserver 116 in FIG. 1, and the program code 617 may includecomputer-usable program code capable of receiving reference speech inputcomprising a reference passphrase to form a reference recording, anddetermining a reference set of feature vectors for the referencerecording. The reference set of feature vectors may have a timedimension. The program code 617 may also include computer-usable programcode capable of receiving test speech input comprising a test passphraseto form a test recording, and determining a test set of feature vectorsfor the test recording. The test set of feature vectors may have thetime dimension. The program code 617 may also include computer-usableprogram code capable of correlating the reference set of feature vectorswith the test set of feature vectors over the time dimension, andcomparing the reference set of feature vectors to the test set offeature vectors to determine whether the test passphrase matches thereference passphrase in response to correlating the reference set offeature vectors with the test set of feature vectors over the timedimension. The program code 617 may also include computer-usable programcode capable of determining a reference fundamental frequency of thereference recording, determining a test fundamental frequency of thetest recording, and comparing the reference fundamental frequency to thetest fundamental frequency to determine whether a speaker of the testspeech input matches a speaker of the reference speech input. Theprogram code 617 may also include computer-usable program code capableof authenticating the speaker of the test speech input in response todetermining that the reference passphrase matches the test passphraseand that the speaker of the test speech input matches the speaker of thereference speech input.

In another embodiment, the program code 617 may include computer-usableprogram code capable of receiving reference speech input including areference passphrase to form a reference recording and determining areference set of feature vectors for the reference recording. Thereference set of feature vectors has a time dimension. The program code617 may also include computer-usable program code capable of receivingtest speech input including a test passphrase to form a test recordingand determining a test set of feature vectors for the test recording.The test set of feature vectors has the time dimension. The program code617 may also include computer-usable program code capable of classifyingeach frame in the reference set of feature vectors and the test set offeature vectors as one of a voiced frame or a silent frame to form avoiced reference set of feature vectors and a voiced test set of featurevectors, comparing the voiced reference set of feature vectors to thevoiced test set of feature vectors to determine a length ratio, anddetermining whether the test passphrase is different from the referencepassphrase based on the length ratio. The program code 617 may alsoinclude computer-usable program code capable of correlating the voicedreference set of feature vectors with the voiced test set of featurevectors over the time dimension and comparing the voiced reference setof feature vectors to the voiced test set of feature vectors todetermine whether the test passphrase matches the reference passphrasein response to correlating the voiced reference set of feature vectorswith the voiced test set of feature vectors over the time dimension. Theprogram code 617 may also include computer-usable program code capableof determining a set of reference fundamental frequency values for thereference recording, determining a set of test fundamental frequencyvalues for the test recording, identifying a set of local peakfundamental frequency values in the set of reference fundamentalfrequency values and the set of test fundamental frequency values,excluding the set of local peak fundamental frequency values from theset of reference fundamental frequency values and the set of testfundamental frequency values to form a modified set of referencefundamental frequency values and a modified set of test fundamentalfrequency values, comparing the modified set of reference fundamentalfrequency values to the modified set of test fundamental frequencyvalues to determine whether a speaker of the test speech input matches aspeaker of the reference speech input, and authenticating the speaker ofthe test speech input in response to determining that the referencepassphrase matches the test passphrase and that the speaker of the testspeech input matches the speaker of the reference speech input. Anycombination of the above-mentioned computer-usable program code may beimplemented in the program code 617, and any functions of theillustrative embodiments may be implemented in the program code 617.

In one example, the computer-readable media 619 may be in a tangibleform, such as, for example, an optical or magnetic disc that is insertedor placed into a drive or other device that is part of the persistentstorage 609 for transfer onto a storage device, such as a hard drivethat is part of the persistent storage 609. In a tangible form, thecomputer-readable media 619 also may take the form of a persistentstorage, such as a hard drive or a flash memory that is connected to thecomputing device 602. The tangible form of the computer-readable media619 is also referred to as computer recordable storage media.

Alternatively, the program code 617 may be transferred to the computingdevice 602 from the computer-readable media 619 through a communicationlink to the communications unit 611 or through a connection to theinput/output unit 613. The communication link or the connection may bephysical or wireless in the illustrative examples. The computer-readablemedia 619 also may take the form of non-tangible media, such ascommunication links or wireless transmissions containing the programcode 617.

The different components illustrated for the computing device 602 arenot meant to provide architectural limitations to the manner in whichdifferent embodiments may be implemented. The different illustrativeembodiments may be implemented in a data processing system includingcomponents in addition to or in place of those illustrated for computingdevice 602. Other components shown in FIG. 8 can be varied from theillustrative examples shown.

As one example, a storage device in the computing device 602 is anyhardware apparatus that may store data. The memory 607, the persistentstorage 609, and the computer-readable media 619 are examples of storagedevices in a tangible form.

In another example, a bus system may be used to implement thecommunications fabric 603 and may be comprised of one or more buses,such as a system bus or an input/output bus. Of course, the bus systemmay be implemented using any suitable type of architecture that providesfor a transfer of data between different components or devices attachedto the bus system. Additionally, the communications unit 611 may includeone or more devices used to transmit and receive data, such as a modemor a network adapter. Further, a memory may be, for example, the memory607 or a cache such as found in an interface and memory controller hubthat may be present in the communications fabric 603.

The principles of the present invention can take the form of an entirelyhardware embodiment, an entirely software embodiment, or an embodimentcontaining both hardware and software elements. In one embodiment, theinvention is implemented in software, which includes but is not limitedto, firmware, resident software, microcode, and other computer readablecode.

Furthermore, the principles of the present invention can take the formof a computer program product accessible from a computer-usable orcomputer-readable medium providing program code for use by or inconnection with a computer or any instruction execution system. For thepurposes of this description, a computer-usable or computer readablemedium can be any tangible apparatus that can contain, store,communicate, propagate, or transport the program for use by or inconnection with the instruction execution system, apparatus, or device.

The previous detailed description is of a small number of embodimentsfor implementing the invention and is not intended to be limiting inscope. One of skill in this art will immediately envisage the methodsand variations used to implement this invention in other areas thanthose described in detail. The following claims set forth a number ofthe embodiments of the invention disclosed with greater particularity.

What is claimed is:
 1. A method for authenticating a speaker, the method comprising: receiving reference speech input comprising a reference passphrase to form a reference recording; determining a reference set of feature vectors for the reference recording, the reference set of feature vectors having a time dimension; receiving test speech input comprising a test passphrase to form a test recording; determining a test set of feature vectors for the test recording, the test set of feature vectors having the time dimension; correlating the reference set of feature vectors with the test set of feature vectors over the time dimension; comparing the reference set of feature vectors to the test set of feature vectors to determine whether the test passphrase matches the reference passphrase; determining a reference fundamental frequency of the reference recording; determining a test fundamental frequency of the test recording; comparing the reference fundamental frequency to the test fundamental frequency to determine whether a speaker of the test speech input matches a speaker of the reference speech input; and authenticating the speaker of the test speech input in response to determining that the reference passphrase matches the test passphrase and that the speaker of the test speech input matches the speaker of the reference speech input.
 2. The method of claim 1, wherein the reference recording and the test recording are digital recordings having an original sampling rate, further comprising: determining the reference set of feature vectors for the reference recording after converting the reference recording from the original sampling rate to a conversion sampling rate; and determining the test set of feature vectors for the test recording after converting the test recording from the original sampling rate to the conversion sampling rate.
 3. The method of claim 1, wherein the reference set of feature vectors comprises 13-dimensional Mel Cepstrum feature vectors, and wherein the test set of feature vectors comprises 13-dimensional Mel Cepstrum feature vectors.
 4. The method of claim 1, wherein correlating the reference set of feature vectors with the test set of feature vectors over the time dimension is performed using dynamic time warping.
 5. The method of claim 1, wherein correlating the reference set of feature vectors with the test set of feature vectors over the time dimension is performed using a derivative dynamic time warping process, the derivative dynamic time warping process outputting a minimal cumulative distance DTW(Q,C) normalized by K to form a value; wherein ${{DTW}\left( {Q,C} \right)} = {\min \left\{ {\frac{\sqrt{\sum_{k = 1}^{K}w_{k}}}{K};} \right.}$ wherein K is a number of elements in a warping path W; wherein W=w₁, w₂, . . . , w_(k), . . . , w_(K); the method further comprising defining a scoring weight that applies to DTW(Q,C) based on a cumulative length of the reference passphrase and the test passphrase, the scoring weight determining a threshold used to determine whether the test passphrase matches the reference passphrase.
 6. The method of claim 1, wherein the reference set of feature vectors and the test set of feature vectors each comprise a plurality of frames, further comprising: classifying each frame in the reference set of feature vectors and the test set of feature vectors as one of a voiced frame or a silent frame to form a voiced reference set of feature vectors and a voiced test set of feature vectors; comparing the voiced reference set of feature vectors to the voiced test set of feature vectors to determine a length ratio; and determining whether the test passphrase is different from the reference passphrase based on the length ratio.
 7. The method of claim 1, wherein determining the reference fundamental frequency of the reference recording comprises determining a set of reference fundamental frequency values for the reference recording, each of the set of reference fundamental frequency values corresponding to a respective voiced frame in the reference recording; wherein determining the test fundamental frequency of the test recording comprises determining a set of test fundamental frequency values for the test recording, each of the set of test fundamental frequency values corresponding to a respective voiced frame in the test recording; and wherein comparing the reference fundamental frequency to the test fundamental frequency comprises determining a resulting distance measure between the set of reference fundamental frequency values and the set of test fundamental frequency values to form a matching score, and comparing the matching score to a preconfigured matching threshold to determine whether the speaker of the test speech input matches the speaker of the reference speech input.
 8. The method of claim 1, further comprising: reducing noise in the reference recording and the test recording prior to determining the reference set of feature vectors for the reference recording and determining the test set of feature vectors for the test recording.
 9. A speech-based speaker recognition system comprising: a passphrase recognition module to determine whether a test passphrase spoken as test speech input matches a reference passphrase spoken as reference speech input; a voice feature recognition module to determine whether a pitch of a speaker of the test passphrase matches a pitch of a speaker of the reference passphrase; and a recording storage to store a reference speech recording accessible by the passphrase recognition module and the voice feature recognition module, the reference speech recording comprising the reference passphrase.
 10. The speech-based speaker recognition system of claim 9, wherein the passphrase recognition module comprises a passphrase comparison engine to compare the test passphrase to the reference passphrase to determine whether the test passphrase matches the reference passphrase.
 11. The speech-based speaker recognition system of claim 10, wherein the passphrase comparison engine comprises a passphrase match scoring module to determine a score based on similarity between the test passphrase and the reference passphrase, and wherein the passphrase comparison engine determines whether the test passphrase matches the reference passphrase based on the score.
 12. The speech-based speaker recognition system of claim 9, wherein the passphrase recognition module comprises: a feature vector module for determining a test set of feature vectors for the test passphrase and for determining a reference set of feature vectors for the reference passphrase; and a dynamic time warping module to correlate the reference set of feature vectors with the test set of feature vectors over a time dimension.
 13. The speech-based speaker recognition system of claim 9, wherein the voice feature recognition module comprises a voice feature comparison engine to compare the pitch of the speaker of the test passphrase with the pitch of the speaker of the reference passphrase to determine whether the speaker of the test passphrase matches the speaker of the reference passphrase.
 14. The speech-based speaker recognition system of claim 13, wherein the voice feature comparison engine comprises a voice feature match scoring module to determine a matching score based on similarity between the pitch of the speaker of the test passphrase with the pitch of the speaker of the reference passphrase, and wherein the voice feature comparison engine determines whether the speaker of the test passphrase matches the speaker of the reference passphrase based on the matching score.
 15. The speech-based speaker recognition system of claim 9, the voice feature recognition module comprising a fundamental frequency module to determine the pitch of the speaker of the test passphrase and to determine the pitch of the speaker of the reference passphrase.
 16. A method for authenticating a speaker, the method comprising: receiving reference speech input comprising a reference passphrase to form a reference recording; determining a reference set of feature vectors for the reference recording, the reference set of feature vectors having a time dimension and comprising a plurality of frames; receiving test speech input comprising a test passphrase to form a test recording; determining a test set of feature vectors for the test recording, the test set of feature vectors having the time dimension and comprising a plurality of frames; classifying each frame in the reference set of feature vectors and the test set of feature vectors as one of a voiced frame or a silent frame to form a voiced reference set of feature vectors and a voiced test set of feature vectors; comparing the voiced reference set of feature vectors to the voiced test set of feature vectors to determine a length ratio; determining whether the test passphrase is different from the reference passphrase based on the length ratio; correlating the voiced reference set of feature vectors with the voiced test set of feature vectors over the time dimension; comparing the voiced reference set of feature vectors to the voiced test set of feature vectors to determine whether the test passphrase matches the reference passphrase; determining a set of reference fundamental frequency values for the reference recording; determining a set of test fundamental frequency values for the test recording; identifying a set of local peak fundamental frequency values in the set of reference fundamental frequency values and the set of test fundamental frequency values; excluding the set of local peak fundamental frequency values from the set of reference fundamental frequency values and the set of test fundamental frequency values to form a modified set of reference fundamental frequency values and a modified set of test fundamental frequency values; comparing the modified set of reference fundamental frequency values to the modified set of test fundamental frequency values to determine whether a speaker of the test speech input matches a speaker of the reference speech input; and authenticating the speaker of the test speech input in response to determining that the reference passphrase matches the test passphrase and that the speaker of the test speech input matches the speaker of the reference speech input.
 17. The method of claim 16, wherein classifying each frame in the reference set of feature vectors and the test set of feature vectors as one of the voiced frame or the silent frame comprises classifying a given frame in the reference set of feature vectors and the test set of feature vectors as the voiced frame when an energy level of the given frame exceeds an energy threshold.
 18. The method of claim 16, wherein correlating the reference set of feature vectors with the test set of feature vectors over the time dimension is performed using a derivative dynamic time warping process, wherein the derivative dynamic time warping process is applied to the voiced reference set of feature vectors and the voiced test set of feature vectors.
 19. The method of claim 16, wherein determining whether the test passphrase is different from the reference passphrase based on the length ratio comprises determining that the test passphrase differs from the reference passphrase in response to determining that the length ratio exceeds a predetermined ratio.
 20. The method of claim 16, wherein comparing the modified set of reference fundamental frequency values to the modified set of test fundamental frequency values comprises: determining a resulting distance measure between the modified set of reference fundamental frequency values and the modified set of test fundamental frequency values to form a matching score; and comparing the matching score to a preconfigured matching threshold to determine whether the speaker of the test speech input matches the speaker of the reference speech input. 